Whats Covered

Aditional Resources


Mutating Joins


Welcome to the course!

  • Advantages to dplyr over base R merge function for joing data
    • dplyr always preserves the row order
    • dplyr has much more intuitive syntax
    • dplyr can be applied to databases, or spark
  • dplyr is a front end language for maniulating data that can be converted to multiple backends like SQL or spark.
    • This is pretty useful. Everything we cover in this class could work on a spark dataframe or a sql table

Keys

  • The Primary key needs to be unique in a table
  • The foreign key in the second table can be duplicated
  • second table will be matched to the primary table based on the primary key
  • The primary key may be one, two or even more columns in the table

Primary Keys

  • What is the primary key in artists dataset?
    • name
artists0 %>% 
  head(10) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
name instrument
Jimmy Buffett Guitar
George Harrison Guitar
Mick Jagger Vocals
Tom Jones Vocals
Davy Jones Vocals
John Lennon Guitar
Paul McCartney Bass
Jimmy Page Guitar
Joe Perry Guitar
Elvis Presley Vocals

Secondary Keys

  • Which datasets have a primary key that matches artists\(name? + `bands\)name+songs$writer`
albums0 %>% 
  head(10) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
name band year
A Hard Day’s Night The Beatles 1964
Magical Mystery Tour The Beatles 1967
Beggar’s Banquet The Rolling Stones 1968
Abbey Road The Beatles 1969
Led Zeppelin IV Led Zeppelin 1971
The Dark Side of the Moon Pink Floyd 1973
Aerosmith Aerosmith 1973
Rumours Fleetwood Mac 1977
Hotel California Eagles 1982
bands0 %>% 
  head(10) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
name band
John Bonham Led Zeppelin
John Paul Jones Led Zeppelin
Jimmy Page Led Zeppelin
Robert Plant Led Zeppelin
George Harrison The Beatles
John Lennon The Beatles
Paul McCartney The Beatles
Ringo Starr The Beatles
Jimmy Buffett The Coral Reefers
Mick Jagger The Rolling Stones
songs0 %>% 
  head(10) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
song album writer
Come Together Abbey Road John Lennon
Dream On Aerosmith Steven Tyler
Hello, Goodbye Magical Mystery Tour Paul McCartney
It’s Not Unusual Along Came Jones Tom Jones

Multi-variable keys

  • What is the primary key in artists?
    • The combination of first and last
artists %>% 
  head(10) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
first last instrument
Jimmy Buffett Guitar
George Harrison Guitar
Mick Jagger Vocals
Tom Jones Vocals
Davy Jones Vocals
John Lennon Guitar
Paul McCartney Bass
Jimmy Page Guitar
Joe Perry Guitar
Elvis Presley Vocals

The combination of first and last uniquely identifies each person in the dataset, just as the combination of first and last names tend to uniquely identify people in real life.

Joins

  • Joins in dplyr will work on data frames, tibbles (tbl_df), and tbl references

A basic join

As Garrett mentioned in the video, left_join() is the basic join function in dplyr. You can use it whenever you want to augment a data frame with information from another data frame.

For example, left_join(x, y) joins y to x. The second dataset you specify is joined to the first dataset.

# Complete the code to join artists to bands
bands2 <- left_join(bands, artists, by = c('first','last'))

# Examine the results
bands2 %>% 
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
first last band instrument
John Bonham Led Zeppelin NA
John Paul Jones Led Zeppelin NA
Jimmy Page Led Zeppelin Guitar
Robert Plant Led Zeppelin NA
George Harrison The Beatles Guitar
John Lennon The Beatles Guitar
Paul McCartney The Beatles Bass
Ringo Starr The Beatles Drums
Jimmy Buffett The Coral Reefers Guitar
Mick Jagger The Rolling Stones Vocals
Keith Richards The Rolling Stones Guitar
Charlie Watts The Rolling Stones NA
Ronnie Wood The Rolling Stones NA

A second join

You should always check the output of your joins. Just because a join worked, doesn’t mean that it worked as you expected.

For example, the code in the editor joins the same two datasets that you joined in the previous exercise, but it returns a different result. Can you tell what went wrong?

left_join(bands, artists, by = "first") %>% 
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
first last.x band last.y instrument
John Bonham Led Zeppelin Lennon Guitar
John Paul Jones Led Zeppelin NA NA
Jimmy Page Led Zeppelin Buffett Guitar
Jimmy Page Led Zeppelin Page Guitar
Robert Plant Led Zeppelin NA NA
George Harrison The Beatles Harrison Guitar
John Lennon The Beatles Lennon Guitar
Paul McCartney The Beatles McCartney Bass
Paul McCartney The Beatles Simon Guitar
Ringo Starr The Beatles Starr Drums
Jimmy Buffett The Coral Reefers Buffett Guitar
Jimmy Buffett The Coral Reefers Page Guitar
Mick Jagger The Rolling Stones Jagger Vocals
Keith Richards The Rolling Stones Richards Guitar
Charlie Watts The Rolling Stones NA NA
Ronnie Wood The Rolling Stones NA NA
# Fix the code to recreate bands3
left_join(bands, artists, by = c("first","last")) %>% 
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
first last band instrument
John Bonham Led Zeppelin NA
John Paul Jones Led Zeppelin NA
Jimmy Page Led Zeppelin Guitar
Robert Plant Led Zeppelin NA
George Harrison The Beatles Guitar
John Lennon The Beatles Guitar
Paul McCartney The Beatles Bass
Ringo Starr The Beatles Drums
Jimmy Buffett The Coral Reefers Guitar
Mick Jagger The Rolling Stones Vocals
Keith Richards The Rolling Stones Guitar
Charlie Watts The Rolling Stones NA
Ronnie Wood The Rolling Stones NA
  • Notice how the last column is pulled in as last.x and last.y if not included in the join
    • And how all combination of rows are created for each name that is duplicated, e.g. john
  • The easiest mistake when joining datasets is to use an incorrect combination of keys.

A right join

There is more than one way to execute a left join. Knowing multiple methods will make you a more versatile data scientist, especially as you try to fit joins into pipes created with %>%.

In this exercise, you’ll recreate bands2 once more, but this time without using a left_join().

# Finish the code below to recreate bands3 with a right join
bands2 <- left_join(bands, artists, by = c("first", "last"))
bands3 <- right_join(artists, bands, by = c("first", "last"))

# Check that bands3 is equal to bands2
setequal(bands2, bands3)
## [1] TRUE

left_join() is equivalent to right_join() with the order of the datasets reversed. Notice that each used the same by argument.

Variations on joins

  • left_join - prioritizes left dataset
  • right_join - prioritizes right dataset
  • inner_join - only retains rows in both datasets
  • full_join - retains all rows
  • Use %>% (pipes) to string together these joins

Inner joins and full joins

You may have noticed that some of the songs in songs correspond to some of the albums in albums. Suppose you want a new dataset that contains all of the songs for which you have data from both albums and songs. How would you make it?

The artists and bandsdatasets also share some information. What if you want to join these two datasets in such a way that you retain all of the information available in both tables, without throwing anything away?

You can think of inner joins as the most strict type of join: they only retain observations that appear in both datasets. In contrast, full joins are the most permissive type of join: they return all of the data that appears in both datasets (often resulting in many missing values).

Recall that, *_join(x, y) joins y to x. The second dataset you specify is joined to the first dataset.

# Join albums to songs using inner_join()
songs %>% 
  head() %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
song album first last
Come Together Abbey Road John Lennon
Dream On Aerosmith Steven Tyler
Hello, Goodbye Magical Mystery Tour Paul McCartney
It’s Not Unusual Along Came Jones Tom Jones
albums %>% 
  head() %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
album band year
A Hard Day’s Night The Beatles 1964
Magical Mystery Tour The Beatles 1967
Beggar’s Banquet The Rolling Stones 1968
Abbey Road The Beatles 1969
Led Zeppelin IV Led Zeppelin 1971
The Dark Side of the Moon Pink Floyd 1973
inner_join(songs, albums, by = "album") %>% 
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
song album first last band year
Come Together Abbey Road John Lennon The Beatles 1969
Dream On Aerosmith Steven Tyler Aerosmith 1973
Hello, Goodbye Magical Mystery Tour Paul McCartney The Beatles 1967
# Join bands to artists using full_join()
artists %>% 
  head() %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
first last instrument
Jimmy Buffett Guitar
George Harrison Guitar
Mick Jagger Vocals
Tom Jones Vocals
Davy Jones Vocals
John Lennon Guitar
bands %>% 
  head() %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
first last band
John Bonham Led Zeppelin
John Paul Jones Led Zeppelin
Jimmy Page Led Zeppelin
Robert Plant Led Zeppelin
George Harrison The Beatles
John Lennon The Beatles
full_join(artists, bands, by = c("first","last")) %>% 
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
first last instrument band
Jimmy Buffett Guitar The Coral Reefers
George Harrison Guitar The Beatles
Mick Jagger Vocals The Rolling Stones
Tom Jones Vocals NA
Davy Jones Vocals NA
John Lennon Guitar The Beatles
Paul McCartney Bass The Beatles
Jimmy Page Guitar Led Zeppelin
Joe Perry Guitar NA
Elvis Presley Vocals NA
Keith Richards Guitar The Rolling Stones
Paul Simon Guitar NA
Ringo Starr Drums The Beatles
Joe Walsh Guitar NA
Brian Wilson Vocals NA
Nancy Wilson Vocals NA
John Bonham NA Led Zeppelin
John Paul Jones NA Led Zeppelin
Robert Plant NA Led Zeppelin
Charlie Watts NA The Rolling Stones
Ronnie Wood NA The Rolling Stones

Pipes

You can combine dplyr functions together with the pipe operator, %>%, to build up an analysis step-by-step. %>% takes the result of the code that comes before it and “pipes” it into the function that comes after it as the first argument of the function.

So for example, the two pieces of code below do the same thing:

full_join(artists, bands, 
          by = c("first", "last")) %>% 
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")

artists %>% 
  full_join(bands, by = c("first", "last"))

Pipes are so efficient for multi-step analysis that you will use them for the remainder of the exercises in this course.

# Find guitarists in bands dataset (don't change)
temp <- left_join(bands, artists, by = c("first", "last"))
temp <- filter(temp, instrument == "Guitar")
select(temp, first, last, band)
## # A tibble: 5 x 3
##   first  last     band              
##   <chr>  <chr>    <chr>             
## 1 Jimmy  Page     Led Zeppelin      
## 2 George Harrison The Beatles       
## 3 John   Lennon   The Beatles       
## 4 Jimmy  Buffett  The Coral Reefers 
## 5 Keith  Richards The Rolling Stones
# Reproduce code above using pipes
bands %>% 
  left_join(artists, by = c("first", "last")) %>%
  filter(instrument == "Guitar") %>%
  select(first, last, band)
## # A tibble: 5 x 3
##   first  last     band              
##   <chr>  <chr>    <chr>             
## 1 Jimmy  Page     Led Zeppelin      
## 2 George Harrison The Beatles       
## 3 John   Lennon   The Beatles       
## 4 Jimmy  Buffett  The Coral Reefers 
## 5 Keith  Richards The Rolling Stones

Practice with pipes and joins

We’ve created a data frame for you called goal. Your mission, if you choose to accept it, is to create a dataset that’s identical to goal.

# Examine the contents of the goal dataset
goal %>% 
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
first last instrument band song album
Tom Jones Vocals NA It’s Not Unusual Along Came Jones
John Lennon Guitar The Beatles Come Together Abbey Road
Paul McCartney Bass The Beatles Hello, Goodbye Magical Mystery Tour
# Create goal2 using full_join() and inner_join() 
goal2 <- artists %>%
  full_join(bands, by = c("first","last")) %>%
  inner_join(songs, by = c("first","last"))
  
  
# Check that goal and goal2 are the same
setequal(goal, goal2)
## [1] TRUE

Remember that the order of the joins in your pipe determines the output of the pipe.

Choose your joins

You’re getting the hang of pipes now! They are a very useful way to combine multiple joins to make a single dataset.

# Create one table that combines all information
artists %>%
  full_join(bands, by = c("first","last")) %>%
  full_join(songs, by = c("first","last")) %>%
  full_join(albums, by = c("album", "band")) %>% 
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
first last instrument band song album year
Jimmy Buffett Guitar The Coral Reefers NA NA NA
George Harrison Guitar The Beatles NA NA NA
Mick Jagger Vocals The Rolling Stones NA NA NA
Tom Jones Vocals NA It’s Not Unusual Along Came Jones NA
Davy Jones Vocals NA NA NA NA
John Lennon Guitar The Beatles Come Together Abbey Road 1969
Paul McCartney Bass The Beatles Hello, Goodbye Magical Mystery Tour 1967
Jimmy Page Guitar Led Zeppelin NA NA NA
Joe Perry Guitar NA NA NA NA
Elvis Presley Vocals NA NA NA NA
Keith Richards Guitar The Rolling Stones NA NA NA
Paul Simon Guitar NA NA NA NA
Ringo Starr Drums The Beatles NA NA NA
Joe Walsh Guitar NA NA NA NA
Brian Wilson Vocals NA NA NA NA
Nancy Wilson Vocals NA NA NA NA
John Bonham NA Led Zeppelin NA NA NA
John Paul Jones NA Led Zeppelin NA NA NA
Robert Plant NA Led Zeppelin NA NA NA
Charlie Watts NA The Rolling Stones NA NA NA
Ronnie Wood NA The Rolling Stones NA NA NA
Steven Tyler NA NA Dream On Aerosmith NA
NA NA NA The Beatles NA A Hard Day’s Night 1964
NA NA NA The Rolling Stones NA Beggar’s Banquet 1968
NA NA NA Led Zeppelin NA Led Zeppelin IV 1971
NA NA NA Pink Floyd NA The Dark Side of the Moon 1973
NA NA NA Aerosmith NA Aerosmith 1973
NA NA NA Fleetwood Mac NA Rumours 1977
NA NA NA Eagles NA Hotel California 1982

Session info

sessionInfo()
## R version 3.5.2 (2018-12-20)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 16299)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=German_Switzerland.1252  LC_CTYPE=German_Switzerland.1252   
## [3] LC_MONETARY=German_Switzerland.1252 LC_NUMERIC=C                       
## [5] LC_TIME=German_Switzerland.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_3.1.0    dplyr_0.8.0.1    gapminder_0.3.0  kableExtra_1.0.1
## [5] knitr_1.21      
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.0        highr_0.7         plyr_1.8.4       
##  [4] pillar_1.3.1      compiler_3.5.2    prettydoc_0.2.1  
##  [7] tools_3.5.2       digest_0.6.18     gtable_0.2.0     
## [10] evaluate_0.12     tibble_2.0.1      viridisLite_0.3.0
## [13] pkgconfig_2.0.2   rlang_0.3.1       cli_1.0.1        
## [16] rstudioapi_0.9.0  yaml_2.2.0        xfun_0.4         
## [19] withr_2.1.2       httr_1.4.0        stringr_1.4.0    
## [22] xml2_1.2.0        hms_0.4.2         webshot_0.5.1    
## [25] grid_3.5.2        tidyselect_0.2.5  glue_1.3.0       
## [28] R6_2.4.0          fansi_0.4.0       rmarkdown_1.11   
## [31] readr_1.3.1       purrr_0.3.0       magrittr_1.5     
## [34] codetools_0.2-15  scales_1.0.0      htmltools_0.3.6  
## [37] assertthat_0.2.0  rvest_0.3.2       colorspace_1.4-0 
## [40] utf8_1.1.4        stringi_1.3.1     lazyeval_0.2.1   
## [43] munsell_0.5.0     crayon_1.3.4